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Feasibility of Learning 


Roadmap 
@ When Can Machines Learn? 









Lecture 3: Types of Learning 


focus: binary classification or regression from a 
batch of supervised data with concrete features 





Lecture 4: Feasibility of Learning 
e Learning is Impossible? 

e Probability to the Rescue 

e Connection to Learning 

e Connection to Real Learning 





@ Why Can Machines Learn? 
Q How Can Machines Learn? 
@ How Can Machines Learn Better? 
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Feasibility of Learning Learning is Impossible? 


A Learning Puzzle 


A Un — —1 


Un — +1 


















































g(x) =? 





let’s test your ‘human learning’ 
with 6 examples :-) | 
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Feasibility of Learning Learning is Impossible? 


Two Controversial Answers 
whatever you say about g(x), 
























































truth f(x) = +1 because ... truth f(x) = —1 because ... 
e symmetry = +1 e left-top black = -1 
e (black or white count = 3) or e middle column contains at 
(black count = 4 and most 1 black and right-top 
middle-top black) <= +1 white = -1 





all valid reasons, your adversarial teacher 
can always call you ‘didn’t learn’. :-( 
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Feasibility of Learning Learning is Impossible? 


A ‘Simple’ Binary Classification Problem 


Xn Yn = f(Xn) 
000 o 
001 X 
010 x 
o 
x 












011 
100 











e X = {0,1}5, Y = fo, x}, can enumerate all candidate f as H 


pick g € H with all g(X5) = Yn (like PLA), 
does g « f? 





Feasibility of Learning Learning is Impossible? 


No Free Lunch 








x yjg|h hb h h b & f [s 
000] o o o 5 o 0 o & € ð 
001 x x X X X X X X X X 
D 010 x X x <x X X X * X X 
011] o o o o 6 6&6 6 o o ^O 
100] x x x x x X X X X X 
101 ? o O0 ó O X X X X 
110 2 o © X- X 8. o xX 
111 ? o X O X o X o X 




















e gx f inside D: sure! 
e gx f outside D: No! (but that's really what we want!) 





learning from D (to infer something outside D) 
is doomed if any 'unknown' f can happen. :-( | 
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Feasibility of Learning Learning is Impossible? 


Fun Time 


This is a popular 'brain-storming' problem, with a claim that 2% 








of the world’s cleverest population can crack its ‘hidden pattern’. 





(5,3,2) + 151022, (7,2,5) 5 ? 


It is like a ‘learning problem’ with N = 1, x; = (5,3,2), y4 = 151022. 
Learn a hypothesis from the one example to predict on x = (7,2, 5). 
What is your answer? 

@ 151026 © | need more examples to get the correct answer 
@ 143547 © there is no ‘correct’ answer 





Feasibility of Learning Learning is Impossible? 


Fun Time 
This is a popular 'brain-storming' problem, with a claim that 2% 


of the world's cleverest population can crack its 'hidden pattern'. 





(5,3,2) 151029 7-25) = 2 


It is like a ‘learning problem’ with N = 1, x; = (5,3,2), y4 = 151022. 
Learn a hypothesis from the one example to predict on x = (7,2, 5). 
What is your answer? 
@ 151026 © | need more examples to get the correct answer 
@ 143547 © there is no ‘correct’ answer 










Reference Answer: (4) 
Following the same nature of the no-free-lunch problems discussed, 
we cannot hope to be correct under this ‘adversarial’ setting. BTW, 

(2) is the designer’s answer: the first two digits = x, - xo; the next two 
digits = x; - xa; the last two digits = (x4 - Xo + X1 : X3 — X2). 
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Feasibility of Learning Probability to the Rescue 


Inferring Something Unknown 
difficult to infer unknown target f outside D in learning; 
can we infer something unknown in other scenarios? | 


e consider a bin of many many orange and 
green marbles 


e do we know the orange portion 
(probability)? No! 





can you infer the orange probability? 
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Feasibility of Learning Probability to the Rescue 


Statistics 101: Inferring Orange Probability 


sample 














assume N marbles sampled independently, with 


orange fraction = v, 
green fraction = 1 — v, 


orange probability = p, 
green probability = 1 — p, 


with » unknown now v known 


does in-sample v say anything about 
out-of-sample u? | 


Feasibility of Learning Probability to the Rescue 


Possible versus Probable 


does in-sample v say anything about out-of-sample u? | 










possibly not: sample can be mostly 
green while bin is mostly orange 






08000008000 
probably yes: in-sample v likely close 
to unknown y 


formally, what does v say about ,;? | 
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Feasibility of Learning Probability to the Rescue 


Hoeffding’s Inequality (1/2) 


sample of size N 


p = orange 
probability in bin 


v — orange 
eceeee?*** fraction in sample 





bin 
e in big sample (N large), v is probably close to y (within c) 
P [|v — p| > e] < 2exp (-2ên) 


e called Hoeffding’s Inequality, for marbles, coin, polling, ... 





the statement ‘v = p’ is 
probably approximately correct (PAC) | 
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Feasibility of Learning Probability to the Rescue 


Hoeffding’s Inequality (2/2) 
P [|v — p| > e] < 2exp (-2ên) 


e valid for all N and e sample of size N 
e does not depend on p, 
no need to ‘know’ u 
e larger sample size N or 

looser gap e 


==> higher probability for ‘v ~ p 


if large N, can probably infer 
unknown p by known v | 
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Feasibility of Learning Probability to the Rescue 


Fun Time 


Let u = 0.4. Use Hoeffding’s Inequality 


P [|v — p| > d « 2exp (—2c*N) 


to bound the probability that a sample of 10 marbles will have 
v < 0.1. What bound do you get? 


© 0.67 
© 0.40 
© 0.33 
© 0.05 
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Feasibility of Learning Probability to the Rescue 


Fun Time 


Let u = 0.4. Use Hoeffding’s Inequality 


P [|v — m > d < 2exp (—2c*N) 


to bound the probability that a sample of 10 marbles will have 
v < 0.1. What bound do you get? 


© 0.67 
© 0.40 
© 0.33 
Q 0.05 





Set N = 10 and e = 0.3 and you get the 


answer. BTW, (4) is the actual probability and 
Hoeffding gives only an upper bound to that. 
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Feasibility of Learning Connection to Learning 


Connection to Learning 






unknown orange prob. u 
e marble e € bin 
e orange e 





fixed hypothesis h(x) i target f(x) 
exc 
e his wrong = h(x) Z f(x) 
e his right & h(x) = f(x) 
e check hon D = {(Xn, yn )} 
SHY 


f(Xn) 







e greene 
e size-N sample from bin 











of i.i.d. marbles with i.i.d. X5 





if large N & i.i.d. Xn, can probably infer 
unknown [h(x) Z f(x)] probability 
by known [/A(Xn) Æ Yn] fraction 





Feasibility of Learning Connection to Learning 


Added Components 


unknown target function unknown | .. J o — "m: 
TUE, PonX hgf 


(ideal credit approval n 


| adi 


training examples 
D: (X1, yi): (s YN) 


(historical records in bank) 


















































learning 
algorithm 
A 





final hypothesis 
gf 
(‘learned’ formula to be used) 

































hypothesis set 
A. 








fixed h 


(set of candidate formula) 


for any fixed h, can probably infer 


unknown Eou:(h) = E plA(x) z f(x)] 


by known £;,(h) = iE [A(Xn) A yn]. 





Feasibility of Learning Connection to Learning 
The Formal Guarantee 
for any fixed h, in ‘big’ data (N large), 
in-sample error Ein(h) | is probably close to 
out-of-sample error Eout(h) | (within €) 


























P [|Ein(h) — Eou(h)| > c] € 2exp (-2ên) 







same as the ‘bin’ analogy ... 
valid for all N and e 


does not depend on Eout(h), no need to ‘know’ Eout(h) 
—f and P can stay unknown 


‘Ein(h) = Eout(h)’ is probably approximately correct (PAC) 


if ‘Ein(h) ~ Eour(h)’ and ‘En(h) small 
= > Eou(h) small => h = f with respect to P | 
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Feasibility of Learning Connection to Learning 
Verification of One h 
for any fixed h, when data large enough, 
Ein(h) & Eou(h) 
Can we claim 'good learning' (g « f)? 





if Ei, (h) small for the fixed h | if A forced to pick THE has g 





and .A pick the ^ as g => Ej, (h) almost always not small 
real learning: 


A shall make choices c H (like PLA) 
rather than being forced to pick one h>. :-( 
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Feasibility of Learning Connection to Learning 


The ‘Verification’ Flow 


unknown target function unknown 
f: X > YV P on X 


(ideal credit approval Om s$ 
2)” X 


- " 
LM i x 
Es d 
Sees PER A " 
P Lond B4 
rd “a 


verifying examples final hypothesis 
D: (u,y1),: (XN YN) gmf 


(historical records in bank) (given formula to be verified) 











































one 
hypothesis 
h 


(one candidate formula) 


can now use ‘historical records’ (data) to 
verify ‘one candidate formula’ h 


Feasibility of Learning Connection to Learning 
Fun Time 
Your friend tells you her secret rule in investing in a particular stock: 
‘Whenever the stock goes down in the morning, it will go up in the afternoon; 
vice versa. 





What is the best guarantee that you can get from the verification? 


@ You'll definitely be rich by exploiting the rule in the next 100 days. 


@ You'll likely be rich by exploiting the rule in the next 100 days, if the 
market behaves similarly to the last 10 years. 


© You'll likely be rich by exploiting the ‘best rule’ from 20 more 
friends in the next 100 days. 


© You'd definitely have been rich if you had exploited the rule in the 
past 10 years. 
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Fun Time 


To verify the rule, you chose 100 days uniformly at random 
from the past 10 years of stock data, and found that 80 of them satisfy 
the rule. 

You'll definitely be rich by exploiting the rule in the next 100 days. 
You'll likely be rich by exploiting the rule in the next 100 days, if the 
market behaves similarly to the last 10 years. 

You'll likely be rich by exploiting the ‘best rule’ from 20 more 
friends in the next 100 days. 

You'd definitely have been rich if you had exploited the rule in the 
past 10 years. 


Reference Answer: e 





(1): no free lunch; (3): no ‘learning’ guarantee in verification; (4): verifying 
with only 100 days, possible that the rule is mostly wrong for whole 10 years. 


Feasibility of Learning Connection to Real Learning 


Multiple h 


hy ho Au 


Fout(he) Eout(hm) 











0000000000 0000000000 0000000008 
Ein(h1) Ein(h2) En(hm) 


real learning (say like PLA): 
BINGO when getting eeeeeeeecc? 
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Feasibility of Learning Connection to Real Learning 


Coin Game 





Q: if everyone in size-150 NTU ML class flips a coin 5 times, and one 
of the students gets 5 heads for her coin ‘g’. Is ‘g’ really magical? | 
A: No. Even if all coins are fair, the probability that one of the coins 
results in 5 heads is 1 — eye > 99%. | 


BAD sample: £i, and E, far away 
—can get worse when involving 'choice' 





Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/26 


Feasibility of Learning Connection to Real Learning 


BAD Sample and BAD Data 


BAD Sample 
e.g., Eout = 4, but getting all heads (Ej, = 0)! 

















BAD Data for One h 
Eour(h) and E;,(h) far away: 
e.g., Eout big (far from f), but Ein small (correct on most examples) 





Dı Do |... | Dia | ... | Dse7g | ... Hoeffding 
Pp [BAD D for Al € ... 









































Hoeffding: small 


Pp[BADD]= M P(D)- [BADD] 
all possibleD 
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Feasibility of Learning Connection to Real Learning 


BAD Data for Many h 


BAD data for many h 
<=> no ‘freedom of choice’ by A 
<= there exists some h such that Esu(h) and Ej,(h) far away 













































Dı Do oon | Dias || os Ds678 Hoeffding 
h | BAD BAD | Pp [BAD D for hi] <... 
hz BAD Pp [BAD D for h;] € ... 
hy | BAD | BAD BAD Pp [BAD D for ha] <... 
hy | BAD BAD | Pp [BAD D for hy] x ... 
all | BAD | BAD BAD ? 



































for M hypotheses, bound of Pp[BAD D]? 
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Feasibility of Learning Connection to Real Learning 


Bound of BAD Data 





p [BAD 7] 

p [BAD D for hj or BAD D for hz or ... or BAD D for hy] 

p [BAD D for hı] + Pp [BAD D for hə] - ... + P5 [BAD D for hy] 
(union bound) 

2 exp (-22N) + 2exp (-22N) a TAST (-22N) 


2M exp (-22N) 














IA 


IA 





e finite-bin version of Hoeffding, valid for all M, N and e 


e does not depend on any Eout(hm), no need to ‘know’ Eout(Am) 
—f and P can stay unknown 


e ‘Ein(g) = Eout(g)’ is PAC, regardless of A 


‘most reasonable’ .A (like PLA/pocket): 
pick the hm with lowest En(Am) as g | 
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Feasibility of Learning Connection to Real Learning 


The ‘Statistical’ Learning Flow 
if |H| = M finite, N large enough, 


for whatever g picked by A, Eout(g) ~ Ein(g) 
if A finds one g with Ejn(g) ~ 0, 


PAC guarantee for Fout(g) 





= 0 => learning possible :-) 




















unknown target function unknown 
f: X >Y Ponx 
(ideal credit approval formua), € x 


are x 
`~ 
------"7 | Ill 
-7 5a 
Pi `~ 

































: `a 
training examples p final hypothesis 
D: (X1, ya), , (XN, YN) asi gef 
(historical records in bank) 








(‘learned’ formula to be used) 





H . 
M — oc? (like perceptrons) 
(set of candidate formula) —see you in the next lectures 
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Feasibility of Learning Connection to Real Learning 


Fun Time 
Consider 4 hypotheses. 









hy (x) = sign(x;), ha(X) = sign(xz), 


hg(x) = sign(—x1), ha(x) = sign(—xe). 
For any N and e, which of the following statement is not true? 


© the BAD data of ^4 and the BAD data of ho are exactly the same 
© the BAD data of ^4 and the BAD data of hg are exactly the same 
© Pp[BAD for some hy] < 8exp (-2€?N) 
© Pp[BAD for some hy] < 4exp (-2€?N) 


Feasibility of Learning Connection to Real Learning 


Fun Time 


Consider 4 hypotheses. 





hy (x) = sign(x;), h(x) = sign(x2), 


hg(x) = sign(—x1), h4(x) = sign(—x2). 
For any N and e, which of the following statement is not true? 


© the BAD data of ^1 and the BAD data of ho are exactly the same 
@ the BAD data of ^4 and the BAD data of h3 are exactly the same 
© Pp[BAD for some hy] < 8exp (—2c?N) 
© Pp[BAD for some hy] < 4exp (—2c2N) 






















Reference Answer: Gp 


The important thing is to note that (2) is true, 


which implies that (4) is true if you revisit the 
union bound. Similar ideas will be used to 
conquer the M = oo case. 
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Feasibility of Learning Connection to Real Learning 


Summary 
@ When Can Machines Learn? 


Lecture 3: Types of Learning 
Lecture 4: Feasibility of Learning 
e Learning is Impossible? 
absolutely no free lunch outside D 
e Probability to the Rescue 
probably approximately correct outside D 
e Connection to Learning 
verification possible if Ej, (^) small for fixed ^ 
e Connection to Real Learning 
learning possible if |#/| finite and En(g) small 


@ Why Can Machines Learn? 
e next: what if |?/| = oo? 
© How Can Machines Learn? 
@ How Can Machines Learn Better? 


Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 





26/26 


